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Abstract 

The Bayesian approach to Inverse Problems relies predominantly on Markov 
Chain Monte Carlo methods for posterior inference. The typical nonlinear con¬ 
centration of posterior measure observed in many such Inverse Problems presents 
severe challenges to existing simulation based inference methods. Motivated by 
these challenges the exploitation of local geometric information in the form of 
covariant gradients, metric tensors, Levi-Civita connections, and local geodesic 
flows, have been introduced to more effectively locally explore the configuration 
space of the posterior measure. However, obtaining such geometric quantities 
usually requires extensive computational effort and despite their effectiveness 
affect the applicability of these geometrically-based Monte Carlo methods. In 
this paper we explore one way to address this issue by the construction of an em¬ 
ulator of the model from which all geometric objects can be obtained in a much 
more computationally feasible manner. The main concept is to approximate the 
geometric quantities using a Caussian Process emulator which is conditioned on 
a carefully chosen design set of configuration points, which also determines the 
quality of the emulator. To this end we propose the use of statistical experi¬ 
ment design methods to refine a potentially arbitrarily initialized design online 
without destroying the convergence of the resulting Markov chain to the desired 
invariant measure. The practical examples considered in this paper provide a 
demonstration of the significant improvement possible in terms of computational 
loading suggesting this is a promising avenue of further development. 
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1. Introduction 


In Bayesian Inverse Problems one needs to draw samples from a typically 
high dimensional and complicated intractable probability measure [T]. Samples 
are needed to estimate integrals for e.g. point estimates or interval estimates for 
uncertainty quantification. Random Walk Metropolis (RWM) is hampered with 
its random walk nature, and Hybrid Monte Carlo (HMC)[21 El IH El El IZl El E] 
with its exploitation of local gradients and approximate Hamiltonian flows in 
an expanded phase space can substantially improve over RWM. Riemannian 
Manifold Hamiltonian Monte Carlo [5] further takes advantage of local metric 
tensors to adapt the transition kernel of the Markov chain to the local structure 
of the probability measure, and indeed the proposal mechanism is provided by 
the local geodesic flows on the manifold of probability measures [S]. This has 
been demonstrated to allow Markov Chain Monte Carlo (MCMC) to effectively 
explore the types of challenging posterior measures observed in many Inverse 
Problems, see e.g. [10] and the example in Figure in this paper. 

The challenge here is that these geometric objects including gradients, met¬ 
rics, connection components are typically expensive to compute, hindering their 
application in practice. This is due to the requirement of a single forward solve 
of the model in evaluating the likelihood, and this increases with the choice of 
metric tensor and associated connections (second and third order tensors), see 
[To] for detailed developments which exploit adjoint solver codes. 

In this contribution we investigate the feasibility of emulating these expen¬ 
sive geometric quantities using a Gaussian Process model m- The remainder 
of the paper has the following structure. Section 2 briefly reviews Hamiltonian 
Monte Carlo methods. Sections 3 and 4 detail the Gaussian Process emulation 
of potential energies, gradients, second order metric tensors and third order ten¬ 
sor metric connections. Since it is impossible to emulate the expected Fisher 
metric |S] based on the Gaussian Process assumption, we propose to emulate 
the empirical Fisher information in this work. The accuracy of the GP emu¬ 
lator to approximate these geometric quantities depends on the design set, or 
conhgurations, which should be well spread over the distribution to capture its 
geometry. However it is not reasonable to assume such a good design set is 
available initially. Therefore section 5 introduces regeneration miiiiii] as 
a general adaptation framework and experimental design algorithm Mutual In¬ 
formation for Computer Experiments (MICE) [I5j to refine the design set. We 
illustrate the advantage of emulation for geometric Monte Carlo algorithms over 
their full versions with examples in section 6. Finally in section 7, we summarise 
the contribution and discusse some future directions of investigation. 

2. Review of Dynamics and Geometry Inspired Simulation Methods 
2.1. HMC 

Hybrid Monte Carlo (HMC) jS] E] is a Metropolis style algorithm that 
reduces its random walk behaviour by making distant proposals guided by 
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Hamiltonian flows. These distant proposals are found by numerically simu¬ 
lating Hamiltonian dynamics, whose state space consists of its position vector, 
6 G and its momentum vector, p G In application to statistical mod¬ 
els, 6 consist of the model parameters (and perhaps latent variables), and p are 
auxiliary variables. The objective is to sample from the posterior distribution 
tt{9\V) oc 7r(0)L(0|I?), where 7r(0) is the prior and is the likelihood func¬ 

tion. We define the potential energy as U(6) := — log7r(0|I?), and the kinetic 
energy^ K{p), similarly as the minus log of the density of p, which is usually 
assumed p ~ A/'(0, M). Then the total energy, Hamiltonian function is defined 
as their sum: 

H{e, p) = u{e) + K{p) = - log 7T{e\v) + ^p^m-^p (i) 

Therefore the joint density of 6 and p is 7r(0,p) cx exp(—7J(0, p)). Note, the 
covariance matrix M, also referred as the constant mass matrix. 

Given the current state 0, we sample the momentum p ~ A/'(0,M), and 
evolve the joint state z := (0, p) according to Hamilton’s equations: 


e = 

p = 


dH 

dH 

~de 


M-ip 

-VeU{e) 


( 2 ) 


The resulting Hamiltonian dynamics are 1) time reversible, and 2) volume pre¬ 
serving. In practice, however, it is difficult to solve Hamiltonian’s equations 
analytically, so numerical methods such as, leapfrog (or Stormer-Verlet) |161I17] . 
to approximate these equations by discretizing time with small step size e. In 
the standard HMC algorithm, L, of these leapfrog steps, with some step size, e, 
are used to propose a new state, which is either accepted or rejected according 
to the Metropolis acceptance probability [One can refer to|3l for more details]. 


2.2. RHMC 

While HMC explores the parameter space more efficiently than random walk 
Metropolis (RWM), it does not fully exploit the geometric properties of the 
parameter space. In some complex scenarios, e.g. the concentrated nonlin¬ 
ear distribution in figure. HMC does not explore the parameter space as 
’straightforwardly’ as RHMC does. To take advantage of the Riemannian ge¬ 
ometry of statistical models, [ 5 ] proposed Riemannian Manifold HMC (RHMC) 
to improve the efficiency of the standard HMC by automatically adapting to 
the local structure of the parameter space. Following the argument of [H], they 
define Hamiltonian dynamics on the Riemannian manifold endowed with metric 
tensor G(0), usually set to the Fisher information matrix related to the un¬ 
derlying statistical model. As a result, the momentum vector for the resulting 
dynamic system becomes p\9 ~ A/'(0, G(0)). That is, the mass matrix G(0), is 
generalized to be position dependent. The Hamiltonian is defined as follows: 

H{9,p) = -log7r(0|X>)-tilogdetG(e)-f-ip'^G(0)"ip = (;i(0) + ^p'^G(0)~^p 

( 3 ) 
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Figure 1: Comparing Random Walk Metropolis (RWM), Hybrid Monte Carlo (HMC), Rie- 
mann Manifold Hamiltonian Monte Carlo (RHMC) and Lagrangian Monte Carlo (LMC) in 
sampling from a mixture model designed to have a nonlinear concentrated distribution, see 
[ 5 ]. Trajectory length is set to 1.5, the acceptance rates are 0.725, 0.9, 0.7, 0.8 respectively 
for the first 10 iterations. Blue dots are accepted proposals and red lines are sampling paths, 
with the thickness indicating the computational cost per iteration. It is clear that RWM 
cannot explore the support of the distribution efficiently while HMC does a much better job, 
however the global metric structure introduces inefficiencies in exploration, whilst the RHMC 
and LMC methods with their local geometric structure more effectively traverse the space. 


where we denote (j){9) := — log7r(0|I?) + ^ logdet G(0). 

The resulting Riemannian manifold Hamiltonian dynamics becomes non- 
separable since it contains products of 0 and p. As a result, the standard 
leapfrog method is neither time reversible nor volume preserving. To address 
this issue, [5] use the following generalized leapfrog method: 


p(^+i) = pM _ 
^(^+1) _ qW _j_ 
p(^+i) _ pC^+i) 


e 

2 

e 

2 1 












(4) 

(5) 

( 6 ) 


Here, the elements of the vector {v>{6,p))i = —p^di{G(9) ^)p. The above 
series of transformations are 1) time reversible, and 2) volume preserving [^. 


2.3. LMC 

The generalized leapfrog method used for RHMC involves two implicit equa¬ 
tions which require potentially time-consuming fixed-point iterations. In 

([^, it requires repeatedly inverting the mass matrix G(0), an opera¬ 

tion. To alleviate this problem in m the authors propose an explicit integrator 
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for RHMC by using the following Lagrangian dynamics: 
e = V 

V = -v'^T{e)v - Gie)-^^g(i){e) 


where the velocity v := G{6) ^ A/'(O,G(0) ^), and r(0) is the Christoffel 


Symbols of the second kind whose {i,j,k)-th. element is F*- = 
djgim — dmgij) with g^'^ being the (A:,m)-th element of G(0)“^. 
An explicit integrator can be obtained that is time reversible: 






-1 r 


rW 


^(^+ 1 ) _ gW _|_ 


V 


(f+i) 






^(^+3) _ |G(6>(^+^^)-iVe(/i(6»(^+^)) 


(7) 

( 8 ) 

)' 

(9) 


where := However, the integrator is not volume 

preserving thus the acceptance probability is adjusted to ensure the detailed 
balance condition holds (Denote z := (0,v)): 

= min exp(—+ E{z^^y) 
det(I-e/2r2(6>(^+i\v(^+i)))det(I-e/2S7(0(^\v(^+i/2))) ^ 

det(I + £/ 2 f 2 ( 6 >(^+^\ v(^+i/2))) det(I + e/2rj(0(^\ vW)) 
the Jacobian determinant, and E{z) is the energy for the Lagrangian dynamics 
defined as follows: 

EiO, p) = - log Tr{e\V) - i log det G(0) + iv'^G(0)v (11) 

The resulting algorithm, Lagrangian Monte Carlo (LMC), has the advantage of 
working with a fully explicit integrator, however the determinants of the trans¬ 
formations need to be computed and accounted for to correct for the volume 
compression of the Lagrangian dynamics [See 1191 for more details]. Indepen¬ 
dent developments in the molecular dynamics literature arrived at a similar 
compressive generalised hybrid Monte Carlo method |20]. 

In Figure[^ we can see the increasing capability of the methods in exploring a 
complicated parameter space as more geometric information is introduced. How¬ 
ever the associated computational cost increases significantly: RWM is 0(1), 
HMC is 0{D), and RHMC and LMC are 0{D^) (or 0(£)2.373) 
arithmetic algorithms). As discussed it is computationally intensive to directly 
apply these geometric Monte Carlo methods to Bayesian Inverse Problems be¬ 
cause of the demanding cost of the required geometric quantities. A previous 
study where first, second and third order adjoint methods were developed in m 
illustrates the computational scaling challenges inherent in these methods. The 
following section considers statistical emulation as a means of getting around 
this computational bottleneck. 


where 




dzW 
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3. Statistical Emulation 

Statistical emulation was developed as a computational method from the 
work on Design and Analysis of Computer Experiments (DACE) in the 1980’s 
[SUES]. It was introduced as a means of statistical approximation in the simu¬ 
lation of complex process models, with the application mainly to sensitivity and 
uncertainty analysis |23j . Complex models are developed in the sciences and 
engineering to simulate the behaviour of complex physical and natural systems 
and to reflect the scientific understanding of their mechanisms. Therefore, such 
mathematical models or computer programs are often referred as the simulator, 
denoted as y = /(x) with inputs x and outputs y. The standard techniques 
of sensitivity and uncertainty analysis described in [21] demand repeated model 
runs, each of which may be expensive to complete, rendering such methods po¬ 
tentially impractical in practice. Statistical emulators, on the other hand, built 
to efficiently approximate the complex simulators, are much faster to compute 
and find use in for example climate models [251 126] . In application to the above 
sampling methods, we can emulate the required geometric quantities at much 
lower cost instead of exactly calculating them. More specifically, we can predict 
them using a Gaussian Process conditioned on a set of chosen configurations, 

see [2711211111123121]. 

Gaussian Process (GP) priors are used extensively in statistics [21 130] and 
machine learning m- GPs have also become a popular choice in constructing 
emulators m Ell [21 [321 HSl 131 IMj • MGMG methods have been used alongside 
emulated models for inference EI1I21I31I3S], however, there is little in the 
literature for the use of GP emulation to speed up MGMG methods. As far as 
we are aware, [36] was the first to use GPs for emulating gradients for HMG, 
likewise m built emulators for MGMG based on a non-stationary Gaussian 
Process, and we develop these ideas much further in this paper. 

4. Gaussian Process Emulators 

Emulation is based on a set of carefully chosen input points, named design 
points. These points are chosen to represent the simulator, usually equally 
spaced, or uniformly positioned. There are methods such as maximin, Latin 
hypercube [31 [31 in the experimental design literature to choose such design 
points. In the sampling problem considered, we want design points to be better 
adapted to the posterior measure, that is, they should evenly scatter over the 
isocontours. In this section, we assume a priori such a design set on which a GP 
emulator is based. This assumption will be relaxed in the section that follows. 

First, we are interested in emulating U{0) as a function of 0 € using 
Gaussian processes. Following |35j . we assume the squared exponential correla¬ 
tion: 

u{-) ^ gv{pi{-),c{;-)) 

/r(.) = h(.)/ 3 , h(0):=[l,0T,(e2)Tj ^^2) 

C(., •) = C{e\ e^) := exp{-(0* - e^fdmg{pW - 0^)} 
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Given a set of n design points Ttc := {9^, ■ ■ ■ ,6"'}, and conditioned on functional 
outputs ux) := [/(De), we can predict U{9*) at a set of m new points £ := 
{0*^, • • • , denoted as ug: 

M* = Hg/3 + Cgx,Cai(us-Hs/3) (13) 

C* = Ce-Cgx,C^iCx,g 


where /3 is a g' vector (g = 1 + 2£)), := h(X)c) is an n x g matrix with z-th 

row h(6>* ) for i = 1, • • • ,n, Hg = h((£) is an m x g matrix, := C(2)e, S)e) is 
an n X n matrix with (i, j)-element C(0*, 0^) as above, Cg® = C(€, S)e) = Cj,g 
is an TO X n matrix, and Cg := C(€, €) is an to x to matrix. Note in general 
Cx) > 0, and in practice we add a small nugget > 0 to the diagonal of Cx) to 
ensure it is well conditioned [El- 

Given a weak prior for both (3 and cr^, p(/3, a^) oc cr we can integrate out 
/3 to obtain [301 [35] : 


Ug|ux), X, P ~ a'^C**) 

p** = Hg3 + Cgx,C-i(ux, - Hx,3) 

— 

+ (Hg - Cgx,CgiHx,)(HT CgiHx,)”i(Hg - Cgx,C-iHx,)^ 
3 = (HTc^1Hx,)-1hJC^ iux, =: Px,us 

(14) 

where Px) := and Bx) := HjC^j^Hx). p** is also the Best Linear 

Unbiased Predictor (BLUP) for ug conditioned on ux) |2I]- We can further 
integrate out cr^ to obtain a T-process [35] : 

ue\u^,p^Tn-gip**,^^C**) 

a^ = {n-q- 2)-\u^ - Hx,3)^C^'(us, - Hx,3) 

= {n-q- 2)“^u5Cg^[I - Hx)Px)]ux) =:{n-q- 2)“^uJ,Qx)Ux) 

(15) 

where Qxj — HxiPx)]- We have the following proposition. 

Proposition 4.1. 

Px,Hx,=I, Px,Cs=B^1hT, 

Qx)Hx) = 0, Qx)Cx) = I — Px)HJ,, Q® = Qx>; Q® > 0. 


^In the standard MCMC setting, we only need prediction at the current state, i.e. m = 1. 
But we still use symbol m for the convenience of parallelization. 
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Figure 2: GP Emulation with and without derivative information on design points. Leftmost; 
density contour estimated without derivative information conditioned on 20 design points 
(f/(0*)|u5;), |X)c| = 20); Second from left: density contour estimated without derivative in¬ 
formation conditioned on 40 design points |X)c| = 40); Second from right: true 

density contour (f/(0*)); Rightmost: density contour estimated with derivative information 
on 20 design points (U'(0*)|uxi, |X)c| = 20). 


Denote Jj(p := HgPx) + then we can write 


fi** = Lguj, 

C** = Cg - CgsQsCse + HgB-iHT - HgPsCse - (HgPsCsg)^ 

Ps,' 

Qz) 


= Ce-[He Cgs 
= Ce-[He Cev 


s 

0 

Hs Cs 


HT 


Hi 


-1 


Cse 


(17) 


In general p cannot be integrated out no matter what prior is given, therefore 
in practice it is usually fixed at the Maximum Likelihood Estimate (MLE) or 
estimated using an appropriate Monte Carlo method EH and more details in 
appendix Appendix A| . Other parametrizations p — (r is called correlation 
length), p = e~'^ are also used and sometimes preferred because there is no 
positivity restriction as for p in the optimization. In the end, we use p** to 
predict ug with the associated standard errors CTA/diag(C**). 


4-.1. Emulation with derivative information on design points 

The differential is a linear operator, therefore derivatives of a Gaussian pro¬ 
cess are still Gaussian processes given that the mean and covariance functions 
are differentiable. This enables us to take advantage of derivative information 
on design points, V17(S)e), when building the GP emulator, which in many cases 
has been shown empirically to improve prediction |35) . 


















Following the notations in [5^, we denote 


dug = V 0 C/(S)e) 


'dU(Sc)' 

ddi 


dUiSt) 

dSo 


nDxl 


U® 


Us 


dus 


n(l+_D)xl 


where 0 means Kronecker tensor product. Similarly, 


dHs = V 0 h(J)e), 


Hs = 


Hs 

dHs 


n(l + D)xg 


The differential operator is linear and exchangeable with expectation and cor¬ 
relation [15], therefore 


Cor 


E 


du{e^ 


dOl 


dU{e^) 

~W~ 


,u{e^) 




(18a) 


= <k<l + D) + 20il{l + D<k<q)] (18b) 

= J|C(0\0^) = -2pU0l-9i)Ci9\9n (18c) 


Cor 


dU{9^) dU{9^) 


ddl 


del 


52 

deidei 


C{9\9^) 


(18d) 


= [2pk5ki - ApkpM - dim - 0l)]Ci9\9^) (18e) 

Thus we denote Csiso ■= Cor(dus,us)nDxn, where indicates the first 
operand in the correlation operator is the derivative (order 1) of U on S)e, while 
means the second operand is U itself (order 0 derivative) on S)e. It has 
(ifc,j)-th element as in (18c|. Similarly C^i^i := Cor(dus,dus)ni>xnD with 
(ik,jl)-th element as in (18e) for i, j = 1, • • • ,n and k,l = 1, ■ ■ ■ , D. Finally 


Cs = 


CsOs*^ Cs^s^ 

Csis« Cjjigi 


n(l+D) xn(l+D) 


Now to make prediction based on the extended information that includes 
the derivatives, i.e. to predict ue|us, we use the same formulae as in 0 with 
subindices D replaced with 3, that is, us ^ us, Hs ^ Hsi Cs Cs 
and Cgs C^^ := [Cgo^o, Cg 02 )i]„xn(i+D) = [Cor(u£, Us), Cor(u£, dus)]- 
Figure[^shows that with the help of derivative information on 20 design points, 
CP emulation (rightmost) is greatly improved in density estimation compared 
with the one (leftmost) without derivative information, and almost recovers the 
true density. Even without derivative information, emulation based on 40 design 
points (second from left) also recovers the truth. This illustrates the tradeoff 
between using more information on a limited amount of design points and using 
more design points with a limited amount of additional information. 

The effect of gradient information at design points can be explained by the 
following proposition. 
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Proposition 4.2. Conditioned on the same design set Se, Mean Squared Pre¬ 
diction Error (MSPE) of a GP with derivative information is smaller than that 
of GP without derivative information: 

E[(C/(0*) - u{0*)\uvf] < E[(C/(r) - U{0*)\nsf] (19) 

Proof 1. See \Appendix B\ 

Similarly the effect of the number of design points on functional output can 
be explained by the following proposition [i5] . 

Proposition 4.3 (Benjamin Haaland, Vaibhav Maheshwari). For design sets 
'Dzi E'Dc 2, we have 

E[{Ui0*) - U{0*\Dz2)f] < E[{U{0*) - U{0*\Dei)f] (20) 


Remark 1. One can think of GP emulation as approximating an infinite di¬ 
mensional function U{0*) with a vector in finite dimensional space span(uj)) 
(or span(uj))j. The approximation error in a finite dimensional vector space 
span(u 2 ))f or span(u 2 ) 2 )) is always smaller than that in its subspace span(u 2 )) 
(or span(us)J/ 

Heuristically, proposition 1^.31 and proposition \4-3\ mean that the prediction 
will be more credible with more information incorporated. Such extra informa¬ 
tion comes from either derivatives in addition to function values at design points 
or function values at more design points. 

At the end of this subsection, we comment on the derivative information at 
design points, dug. It is not required for our development of methodology, yet 
in many scenarios it is not available, e.g. oil reservoir simulation. When this 
is the case, we relax our notation to let terms with tilde still denote quantities 
without such derivative information. In the following, we denote 


n = 


n, 

n(l + £»), 


dux) absent 
dug present 


4.2. Emulating gradients 

Now we want to predict the gradients of the potential at new points for the 
use in HMC based algorithms, i.e. to predict dugjug. The prediction formulae 
are similar as above ( [T7| with all the 6:^°^ in subindices replaced with to 
indicate emulating derivatives on new points (£. To be more specific, we write 
down the conditional mean and correlation: 

p** = LgiUg 

C** = Cgi - + HeiBgiRT, - HgiPgCgg, - (HgiPgCggJ 

( 21 ) 

where Hgi = dHg, := Cor(due, ug)mDxn- 

One can see from the left panel of figure]^ in this illustrative example of a 
banana shaped distribution, with more and more design points well positioned, 
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Figure 3: Emulation using GP. Left: emulated gradients/metrics approximating true gradi¬ 
ents/metrics; Right: emulations on a 5 X 5 grid mesh. 


the emulated gradients become closer and closer to the true gradients and with 
30 design points spread over the density contour, the emulated gradients are 
close approximations to the true gradients. We can modify proposition |4.3| to 
show the same effect of design size on emulation of gradients: 

Proposition 4.4. For design sets Dci C J)e 2 , we have 

E[{du{e*) - du{e*\Dc2)f] < E[{duie*) - du{e*\'Dei)f] 


4-3. Emulating Hessians 

We also need to emulate the Hessian matrix based on the extended infor¬ 
mation Uj). First, we need to vectorize the Hessian matrix U (0)]. De¬ 

note the vectorized Hessian matrices evaluated at the new points € as := 
V(8)V(8)u(£). The formulae for prediction remain the same as in pTj ) except that 
superindices 1 appearing in will be replaced with 2. Hg 2 = i(I^VLe)mD^xq- 
Cg 2 ^ := Cor(d^Ug, Uj))„£) 2 xfi. Note (ifc/, j)-th element of Cg 22)0 and {ikl, jm)- 
th element of Cg 22 ,i (duj) present) can be deduced from [55] similarly {i,j = 
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1, • • • , n and k,l,m = 1, ■ ■ ■ , D): 


Cor 


d^U{e*’‘) 

dOl^dO*^ 


,u{e^) 


52 


dei^dep 


c{0*\e^) 


= [-2pfc4/ + ^PkPiier - si){er - 9i)]c{e*\ e^) 


Cor 


d^u{e*^) du{e^) 
deader ^ d0L 


33 


dBpdepdei 


■c{9*\e^) 


-4.SklPmPk{dZ - Bjn) - ^5lynPkPl{dk - dl) 


- 'iSjnkPlPmidr - dj) 

+ SpkPiPm[et - 9i){er - 91){9» - 9i,)]c{e*\ e^) 

( 22 ) 



{mD^) 

dug irnD) 

Ug (to) 

Ux) (n) 

dux) {riD) 


Cg2 



Cg2X)0 

Cg2X)l 

dug (jnD) 


Cgi 


Cgix )0 

Cgix)i 

ug (m) 



O 

U 

Cgox )0 

Cgox)i 

Ux) (n) 

Cx)0g2 

Cx)0gi 

Cx)Ogo 

Cx)0x)0 

Cx)0x)i 

dux) {nD) 

Cx)lg2 

Cx)igi 

Cxjigo 

Cx)l £)0 

Cx)isi 


Table 1: Relationship between blocks of the whole correlation matrix. 

Table 0 may help better understand the relationship between blocks of 
the correlation matrix. To sum up, we denote Bx) := Hj C-iHx,, Px, := 
and Qx) := C^^(I —HxjPs). Then all the predictions can be cast 
as a linear mapping of the extended information at designed points, Uxj: 

E[ueo|ux)] = LgaUx), Lgo := Hg^Px) + Cg„gQx), a = 0,1,2 (23) 

Emulating Fisher information 

Last but not the least, the Fisher metric and Christoffel symbols require to 
be emulated for RHMC and LMC. The expected Fisher information involves 
the following expectation with respect to x, not 6 : 

FI(r|£>e)= y V2c/(x,r|De)exp(-17(x,r|2)e))dx (24) 

whose integrand is no longer a GP under assumption ( |12[ ) , thus direct emulation 
of the expected Fisher information using a GP is not readily available. 

We instead consider the empirical Fisher information, which can be a good 
estimate of expected Fisher information when there are sufficient data: 

eFI(6>*|De) = DU{e*\Dc)F)U{e*\'De)'^ - j^WU(e*\'Dc)VU{e*\T)zf 

= DU(e*\'St)[lN - lNiN/N]DU{9*\'Szf =: DU{9*\Dc)JN^U{9*\Dcf 

(25) 


12 























where N is the number of data, D[/(0*|J)e) is a. D x N matrix whose 
th element is j^U{xj,9*\Dc), and V?7(0*|®e) = D{7(0*|S)e)lAr is a I? vector 

whose i-th element is g|^C/(X,0*|2)e) = ]^U{x.j,9*\'Se). J^v := Iat — 

Now we focus on how to estimate Dt7(0*|De), instead of V17(0*|2)e). 
It is still impossible to estimate ^U{xj,9*\^t) with assumption ( |l^ , so 
we relax it to assume the same GP for U{xj, •) across different x^-’s: 


(26) 

Denote Uj) := 17(X, Se) as an n x TV matrix with (i,_))-th element U{xj, 9’') and 

dUj) = V (8)C/(X,iDe) as an nD x N matrix if available. Let U^) = [Uj,,(iUg] 
denote an n x N matrix. Similarly dUg = V ® t/(X, 0;) denotes an mD x N 
matrix. ^ 

Applying a similar a^ument to each column of U©, we obtain the linear 
prediction (23) for dUglUj): 

E[dUg|Us,] = LeiUs (27) 

To predict the empirical Fisher information on the evaluation set 2:, eFIg|Us), 
we substitute DC7(€|2)e) « LgiUs back in (25) to get the following mD x mD 
matri^lE] 


E[dUg|Us]JjvE[dUe|U£,] = LgiUsJ^UjLT, =: LgigFIj,L 


(£1 


(28) 


where we name gFIj, := UsJatUJ, as generalized (empirical) Fisher informa¬ 
tion. It is an h X h matrix that can be pre-calculated and stored. The left 
panel of Figure show the effect of different numbers of design points on the 
emulation of empirical Fisher information, and the right panel illustrates the 


exists for emulated Fisher information but we omit them here. 

Note gFIj, in the emulated Fisher information only consists of up to first 
order derivatives on design points. This avoids emulating third order derivatives 
d^U{9*), which otherwise is inevitable in the emulation of Christoffel symbols 
forming the third-order tensor appearing in manifold methods. Considering its 
symmetry, it seems more natural to work with the emulated Fisher information. 
Denote the (i, j)-th element of eFI(0*|J)e) as g*y We have 


emulated Fisher information on a grid mesh. Similar results as proposition 4.3 


9kk = D?fcC/D,C/T + 1 

+ 9tk,j - 9%,k) = D?,C/D,C/T - 


(V2,C/V,C/ + V.C/Vffct/) 

= D^^-DJjvDfcC/T 
(29) 


^To get eFIi^lUx) when m > 1, we could partition matrix ( |28| | to D x D cells with each cell 
an m X m matrix. For each cell, we only take the diagonal entries of such an m x m matrix 
and in this way we obtain m of D x D predicted empirical Fisher information matrices. 
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where := V 0 D, := V 0 V. By a similar argument, FglUs) can be 
emulated with Lg 2 gFl 2 ,Lgi, an mD^ x nD matrix. 

We then use the emulated gradients of potential energy in HMC, emu¬ 
lated Fisher information and Christoffel symbols in RHMC/LMC and con¬ 
duct the Metropolis acceptance test with exactly computed potential energies. 
We name the corresponding algorithms as Gaussian Process emulated (GPe) 
HMC/RHMC/LMC, denoted as GPeGMC. In this paper, we only aim to re¬ 
duce the computational cost of these expensive geometric quantities in HMG 
based algorithms. Note in large scale data problems {N ^ 1), GP emulation 
brings computation of gradients and metrics from 0{ND) down to 0{hD) for 
HMG, from 0{N{f + 3)13^) down to 0{n{f + GjD^) for RHMG (/ the num¬ 
ber of fixed point iterations) and from 0{4:ND^) down to 0{4nD^) for LMG. 
But at the end of each iteration, the acceptance test is done with the exact 
energy, which is 0{N). For large scale complicated models, the computational 
complexity depends on the simulator (e.g. related to mesh size of the solver in 
inverse problems), regardless of the n times model simulation required at design 
points, GPeGMG methods requires only one simulation at the acceptance test of 
each iteration, the same as RWM does. One can push further on computational 
economy by using the emulated potential energies in the Metropolis acceptance 
probability and quantifying the bias caused by such an inexact acceptance test 
[44l im |46], e.g. after a reasonable design set is obtained using the following 
adaption scheme, but it goes beyond the scope of this paper. 

5. Auto-Refinement: Online Design Pool Adaptation 

As mentioned previously, design points play a crucial role in emulation in 
general and with GP emulation in particular. First, the size of the pool of design 
points should be controlled for computational efficiency. Second, the quality of 
design points directly determines the precision of the associated emulator. An 
ideal set of design points should evenly spread over the density contours so 
that the GP conditioning on it models the target probability distribution well, 
however it is often challenging to do so at the beginning of such an analysis, 
especially for computation intensive models. Therefore, it is natural to think of 
adapting the initial design pool and the emulator while generating the Markov 
chain. 

However, doing such adaptation infinitely often will disturb the stationary 
distribution of the chain SZllll]. We refer to regeneration [n m m iH] to 
allow the online adaption to occur at certain ’regeneration times’. Informally, a 
regenerative process “restarts” probabilistically at a set of times, called regen¬ 
eration times |48j . When the chain regenerates, the transition mechanism can 
be modified based on the entire history of the chain up to that point without 
disturbing the consistency of MGMC estimators. 

Traditional space-filling algorithms like Latin-Hypercube, max-min etc. [381 
[39] do not generate a good design set in general because rather than in some reg¬ 
ular shape, the design space, not known a priori, should be characterized by the 
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target distribution. We therefore borrow the idea from sequential experimental 
design to adapt the design pool to the shape of a target density and refine the 
corresponding emulator. Starting from some initial design pool of small size, 
the experimental design method can grow the pool out of some candidate set to 
a desirable size according to some information criterion. Considering the setting 
of MCMC, it is natural to grow the design pool by selecting candidates from 
previous samples which retain partial knowledge of the geometry. On one hand, 
the MCMC sampler feeds the emulator with useful candidate points to refine 
with; on the other hand, the GP emulator returns geometric information (em¬ 
ulated gradients/metrics) efficiently for the MCMC sampler to further explore 
the parameter space. The two form a mutual learning system so that they can 
learn from each other and gradually improve each other. 

5.1. Identifying Regeneration Times 

The main idea behind finding regeneration times is to regard the transition 
kernel T{6t+i\6t) as a mixture of two kernels, Q and R I49j . 


m+i|0t) = s{et)Q{et+i) + (1 - s{et))R{9t+i\et) 

where Q{6t+i) is an independence kernel, and the residual kernel R{0t+i\0t) is 
defined as follows: 


Ti0t+i\0t) - Si0t)Q{0t+i) 


R{0t+,\0t)= 


I 


1-S(0t) 


Si0t)e[o,i) 
Si0t) = 1 


S{0t) is the mixing coefficient between the two kernels such that 


T{0t+i\0t) > S{0t)Q{0t+^),\f0t,0t+i 


(30) 


Now suppose that at iteration t, the current state is 0t. To implement 
this approach, we hrst generate 0 t+i according to the original transition kernel 
0t+i\0t ~ T{-\0t). Then, we sample from a Bernoulli distribution with 
probability 


r{0t,0t+i) 


Si0t)Q{0t+i) 

Ti0t+i\0t) 


(31) 


If Bt^i = 1, a regeneration has occurred, then we discard 0t+i and sample it 
from the independence kernel 0 t+i ~ Q{’)- At regeneration times, we refine 
design points and the associated emulator using the past sample path. 

Ideally, we would like to evaluate regeneration times in terms of the HMC 
style transition kernel. In general, however, this is quite difficult for such a 
Metropolis algorithm. On the other hand, regenerations are easily achieved 
for the independence sampler (i.e., the proposed state is independent from the 
current state) as long as the proposal distribution is close to the target distri¬ 
bution mi. Therefore, we can specify a hybrid sampler that consists of the 
original proposal distribution (GPeGMC) and the independence sampler, and 
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Figure 4: Approximations to the true density. Left: true posterior density; Middle; posterior 
density approximated by Gaussian Process with 30 design points; Right: approximate density 
given by mixture of Gaussians centered at the same design points. 


adapt both proposal distributions whenever a regeneration is obtained on an 
independence sampler step El- In our method, we systematically alternate 
between GPeGMC and the independence sampler while evaluating regeneration 
times based on the independence sampler only gSlIiS. 

More specifically, T{0t+i\Ot), S{9t) and Q{6t+i) are defined as follows to 


satisfy (30): 




S{6t) = min < 1 


TT{et)/q{et) 

Qidt+i) = (?( 0 t+i) min |l, 


(32) 

(33) 

(34) 


A natural choice for q{-) could be based on the emulated potential energy as we 
have seen in figure]^ that it approximates the target distribution very well: 


q(0*) oc exp(—E[[/(0*|(De)]) = exp h(0*)/3 — C(0*, S)) 7 | 


f ~T " 

D 

OC exp < —f3 9* — ^ 


1 



exp 



0^)^diag(p)(0* - } 


where 7 = Cj,^(u 33 — Hj)/3). However in general it is very difficult to directly 
sample from a distribution with such density. In practice, we find that it works 


16 












well to set g(-) to be a density comprised of, for example, a mixture of Gaus- 
sians centered at the design design points iDe with empirical Fisher matrices 
eFI(2)e) as the precision matrices and posterior probabilities exp(—17(iDe)) as 
their relative weights. Figure shows that with a reasonably good design, such 
a mixture of Gaussians could approximate the true posterior density reasonably 
well for practical purposes. 

5.2. Refining the design set 

When the Markov Chain regenerates, we refer to an experimental design 
method based on mutual information to refine the design set and update the 
associated GP emulator accordingly. Such a method sequentially selects design 
points from some candidate set formed by previous MCMC samples. A design 
point is chosen by optimizing the mutual information gain of adding a point to 
the design pool. 

In classical information theory, mutual information m is a standard mea¬ 
sure that has been successfully applied to sensor network design [SH [53] , exper¬ 
imental design |54j , and optimization |55j . Consider two random vectors U and 
C/'with marginal pdfs p[/(u) andp[//(u'), and joint pdf pjy;/'(u, u'). The mutual 
information between them, denoted by I{U] U') is equivalent to the Kullback- 
Leibler divergence T^ifL(’ll’) between puu' ^^nd PuPu' ^^nd linked to the entropy 

m- 

I{U-U') = Dkl{puu'\\puPu') = H{U) - H{U\U') (36) 

where the entropy in the Gaussian Process setting is as follows 

i/(C/(S)e)) oc ^ logdet Cx) (37) 

Given a current design Ttz and a candidate set ®cand, to choose the most 
informative subset from &cand to add to Se, | 33 | propose a greedy algorithm 
based on the following sequential optimization of mutual information: 

r = arg max /(C/(De U { 6 >}); C/(2)e^\{6/})) -/(t/(£>e); 17(De^)) 

O^&cand 

= arg rnax H{U{e)\u^) - H{U{9)\U{^z<^\{e})) ( 33 ) 

= arg max YaviU(6)\u^)/YariU(0)\U(Dc'^\{9})) 

0G0car.d 

where 2)eUS)e° = 0 forms the whole design space. Based on the same idea, m 
improves it and adopts it for computer experiments. Therefore, the algorithm 
is named Mutual Information for Computer Experiments (MICE) |15] and is 
summarized in algorithm 

We now adapt MICE [T5| for our specific purpose of refining the MCMC 
transition kernel. Starting with the current design set, we select a small number, 
k, of points from it as our initial design, and combine the rest of the design points 
with points collected from the previous regeneration tour, to form a candidate 
set. There are two advantages of using MCMC samples for candidates in MICE. 
One is that MCMC samples retain partial knowledge of the geometry of the 
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Initial Design 


Adapting... 


Final Design 


True density 



Figure 5: The evolution of the design pool by the MICE algorithm. Black circles are design 
points and red circles are new design points added to the design in each adaptation. All the 
density contours are produced by GP emulation based on the current design except the true 
one. 


target distribution therefore are more informative than some random points. 
The other is that, potential energies at these points, required in MICE, have 
already been calculated at the Metropolis test step and can be recycled. 

First, before running MICE, we can pre-process the candidate set to narrow 
it down to a smaller subset of points controlling their pairwise distance by a 
max-min method. This is recommended not only to save computation of the 
optimization (38), but also to avoid poor conditioning of the covariance in the 
GP. Second, for the sake of computational efficiency, instead of continuously 
growing the design size, we ’refresh’ the design set with a flexible size to be 
determined by the algorithm. More specifically, after pre-processing the candi¬ 
dates, we use MICE to select points one by one from the processed candidate 
set, and add it to the initial design set until this process is stopped by some 
criterion, e.g., MSPE falls below some threshold. Third, we do such adaptation 
in a more structured way. Regeneration is tested with some interval (e.g. ev¬ 
ery 20 iterations) because too short a regeneration tour will not provide many 
informative candidate points, and the adaptation is stopped when we reach a 
satisfactory (e.g. by testing MSPE on a set of randomly selected samples, or 
monitoring the entropy to reach its stationarity) design set. 

We summarize the adaptive GP emulated GMG (adpGPeGMC) in algorithm 

Figure shows how the design pool is evolved by the MIGE algorithm. Even 
starting with a bad design (crowed at the starting point), modified MICE can 
gradually spread them over the target density contour until it reaches the final 
design based on which Gaussian process accurately emulates the true probability 
distribution. 
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Figure 6: Banana-Biscuit-Doughnut distribution: generated with N = 100, fiy = l,cry = 2, 
and (Tg = 1 as in ||39|l. 


6. Experimental Evaluation 

In this section, we use two synthetic examples and one oil reservoir prob¬ 
lem to evaluate our GPeGMC, compared to the full version of these geometric 
Monte Garlo methods. We use a time-normalized effective sample size (ESS) to 
compare these methods For B posterior samples, the ESS for each param¬ 
eter is defined as ESS = B[1 + , where is the sum of K 

monotone sample autocorrelations [56] . We use the minimum ESS normalized 
by CPU time (s), min(ESS)/s, as the measure of sampling efhciency. We also 
use Relative Error of Mean (REM) and Relative Error of Covariance (REG) to 
measure how fast different algorithms reduce the error in estimating mean and 
covariance of parameters given the limited computation budget. They are de¬ 
fined as || 0 (t) —E 0 II 2 /IIE 0 II 2 and |js( 0 (t)) —Cov 0 || 2 /||Cov 0||2 respectively, where 
6 {t) means samples collected up to time t. All computer codes and data sets 
discussed in this paper are publicly available at http://warwick.ac.uk/slan 

6.1. Simulation: Banana-Biscuit-Doughnut distribution 

Let us first investigate the generalisation of the previously discussed 2d Ba¬ 
nana shaped distribution: 


[ 0/21 [ 0 / 2 ] 
y\9 ^ Af{fj.y,al), Hy := ^ 6 I 2 /C -1 + ^ 






(39) 


0, ~ Af(0,a2) 


The data {yn}n=i generated with chosen fiy, ay, and ag. If we consider D = 
4, then the posterior 9\{yn} looks like a banana in ( 1 , 2 ) dimension, a biscuit 
in (1,3) dimension and a doughnut in (2,4) dimension. Therefore we name the 
distribution as ’Banana-Biscuit-Doughnut (BBD)’ distribution as depicted in 
figure 1 ^ 
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This BBD distribution is complex as three different shapes are twisted 
around the origin. To make it more data intensive, we generate N = 3 x 10® 
data yn with yy = 0, ay = lO'^, and erg = 1. Now we apply all the geometric 
Monte Carlo methods to sample d\{yn} and compare their full versions with 
the GP emulated versions in terms of sampling efficiency in table Results are 
summarised for 10000 samples after burn-in. 


Algorithm 

AP 

s/iter 

ESS 

minESS/s 

spdup 

RWM 

0.68 

9.84E-03 

(3,4,8) 

0.03 

1.00 

HMG 

0.84 

8.28E-02 

(552,873,1102) 

0.67 

21.97 

GPeHMC 

0.78 

2.84E-02 

(622,653,754) 

2.19 

72.14 

RHMC 

- 

- 

- 

- 

- 

GPeRHMC 

0.76 

1.32E-01 

(319,560,756) 

0.24 

7.95 

LMC 

0.91 

2.21E-00 

(910,1120,1251) 

0.04 

1.36 

GPeLMC 

0.78 

2.24E-02 

(574,609,823) 

2.56 

84.36 

adpGPeLMC 

0.67 

2.38E-02 

(252,322,509) 

1.06 

34.90 


Table 2: Sampling Efficiency in the BBD distribution. AP is the acceptance probability, s/iter 
is the CPU time (second) for each iteration, ESS has (min., med., max.) and minESS/s is the 
time-normalized ESS. Spdup is the speed up of sampling efficiency measured by minESS/s 
with RWM as the baseline. 


In this example, it is extremely time consuming to scan 3 x 10® items for each 
gradient evaluation. Though the metric tensor, expected Fisher information, 
can be explicitly calculated, for illustrative purposes we use the empirical Fisher 
information which estimates the expected Fisher information with gradients as 
a means to compare in a fair way the emulated geometric Monte Carlo methods 
to their full versions. For each GPeGMC, 40 design points (samples) are chosen 
from a long run HMC using the MICE algorithm. AdpGPeLMC is the GP 
emulated LMG with online adaptation discussed in section 

The benefit of higher ESS by using the metric in LMG is completely off¬ 
set by the computational cost of geometric quantities, making LMG close to 
RWM in efficiency after normalizing total sampling time. RHMC is even more 
time consuming due to the repeated metric evaluation in the implicit steps 
of the generalised leapfrog integrator, so we exclude it from the comparison. 
Compared with those original geometric Monte Carlo methods, our proposed 
emulated methods in general yield less raw ESS due to the GP emulation, but 
at substantially lower computational cost, yielding much more efficient algo¬ 
rithms. However, due to the complexity of the distribution, online adaptation 
requires some time to ffnd an appropriate configuration that captures the geo¬ 
metric features, and as such we do not observe any advantage of adpGPeLMC 
over GPeLMG in this example, though adpGPeLMC is still much more efficient 
than LMC. 

Since GP emulated Monte Garlo methods generally take less time to gen¬ 
erate a sample, they are usually fast in reducing errors of point estimates, eg. 
mean and covariance, as shown in figure where analytical expression of the 
expected Fisher information is used. This is important in obtaining good esti- 
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Figure 7: Relative error of mean (left) and relative error of covariance (right) in sampling 
from BED distribution. 


mates quickly given a limited computational budget. 

6.2. Elliptic PDE 

Now we consider a canonical inverse problem involving inference of the diffu¬ 
sion coefficients in the following elliptic PDE [57| 06] defined on the unit square 
[ 0 , 1 ] 2 : 

Vx • (c(x, 0)Vxu(x, e)) = 0 
u(x,0)|a;2=o = Xi 

u(x,0)|a;2=i = 1 - ail (40) 

9u(x, 6 ) 9u(x, 6 ) ^ 

This PDE serves as a simple model of steady-state flow in aquifers and other 
subsurface systems; c can represent the permeability of a porous medium while 
u represents the hydraulic head. 

The observations {yi} arise from the solutions on a 11 x 11 grid contaminated 
by additive Gaussian error Sj ~ A/'(0, 0.1^): 


Vi = u{xi,e) + Si 
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Endow the diffusive field c(x) with a log-Gaussian process prior, we want to infer 
its posterior. This prior allows the field to be approximated by a Karhunen- 
Loeve (K-L) expansion [58]: 

/ D 

c(x, 9) « exp I ^ e^y^Cdi^) 

\d=l 

where Xd and Cd(x) are the eigenvalues and eigenfunctions of integral operator 
on [0,1]^ defined by the Gaussian kernel. The parameters 6 are endowed with 
independent Gaussian priors Oi ^ A/'(0,1). Here, the K-L expansion is truncated 
at £) = 6. 

The statistical model is relatively simple. However, each likelihood evalua¬ 
tion involves numerically solving a forward PDE based on a 20 x 20 regular mesh, 
which could be computationally intensive. Geometric MGMG methods require 
further derivatives of likelihood, which involve partial derivatives | |, which 
are even more expensive to calculate. We apply our emulated algorithms to this 
problem and compare the sampling efficiency of different algorithms. The result 
is summarised in table |3| 


Algorithm 

AP 

s/iter 

ESS 

minESS/s 

spdup 

RWM 

0.57 

2.70E-02 

(69,94,249) 

0.25 

1.00 

HMC 

0.76 

4.35E-01 

(3169,4357,5082) 

0.73 

2.87 

GPeHMG 

0.57 

1.56E-02 

(609,1328,2265) 

3.91 

15.38 

RHMC 

0.87 

2.22E-h00 

(5073,5802,6485) 

0.23 

0.90 

GPeRHMC 

0.65 

7.35E-02 

(614,1224,1457) 

0.83 

3.29 

LMC 

0.72 

3.92E-01 

(5170,5804,6214) 

1.32 

5.19 

GPeLMG 

0.64 

2.65E-02 

(774,1427,1754) 

2.92 

11.48 

adpGPeLMG 

0.94 

8.71E-02 

(3328,4058,4543) 

3.82 

15.03 


Table 3: Sampling Efficiency in Elliptic PDE. AP is the acceptance probability, s/iter is the 
CPU time (second) for each iteration, ESS has (min., med., max.) and minESS/s is the time- 
normalized ESS. Spdup is the speed up of sampling efficiency measured by minESS/s with 
RWM as the baseline. 


We run each algorithm for 15000 iterations and burn in the first 5000. We 
tune the step sizes so that they have an acceptance rate around 70%. For 
HMC, RHMC and LMC, since they require solving an elliptic PDE (40) for each 
integration step, we parallelize the computation. Again we observe a drop in 
the raw ESS when comparing emulated algorithms with the full versions, but an 
increase in efficiency due to the computational time cut by GP emulation. The 
pairwise posterior contours of the parameter 6 are included in figure]^ Although 
not completely like a Gaussian, the posterior distribution is not as complicated 
as the BED distribution. Therefore, the online adaption (in adpGPeLMG) can 
obtain a better configuration and exhibit further advantage compared to pre¬ 
fixing the design set (in GPeHMG and GPeLMG) by selecting design points 
from the result of HMC after a long run. We also compare different algorithms 
in error reducing speed in the following figure 
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Figure 8: Relative error of mean (left) and relative error of covariance (right) in Elliptic PDE 
problem. 


6.3. Teal South oil reservoir 


Teal South (Figure 10) is a small oil field located in the northern Gulf of 
Mexico with the information about the field in the public domain. The field 
has a single well through which oil, water and gas are produced, with monthly 
production rates available for each phase. Teal South has been the subject of 
a number of studies [SH IMl IMl [SD- We used a simple reservoir model with 
anllxllx5 grid created by |fi2j . with 9 unknown parameters to describe 
reservoir uncertainty. The 9 parameters are: kh (horizontal permeabilities) for 
each of the 5 layers of the field, ky/kh (vertical to horizontal permeability ratio), 
aquifer strength, rock compressibility and porosity. A set of PDFs describing 
conservation of mass for each of the phases (oil, water, gas), along with Darcy’s 
law are solved to simulate the field oil production rate (FOPR) for a period of 
1200 days starting from November 1996. Given the inputs (9 parameters), the 
outputs (FOPR as a time series) are generated by running tNavigator [63], an 
oil reservoir simulation package. 

In general, reservoir simulation is computationally demanding, with a single 
run of a reservoir simulation model taking anywhere from 15 minutes to several 
hours on high end workstations (see for example |64 ] ). Teal South runs much 
faster than this, as it is a simple model, but even with such a simple model 
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Figure 9: The pairwise posterior contours of the parameter 6 in inverse problem of a classic 
elliptic PDE. 



Figure 10: Teal South oil field 


we cannot afford the tens of thousands of function evaluations that would be 
needed for RWM. To start the inference process, we generated 1000 samples by 
stochastic optimisation algorithms (differential evolution |65j . particle swarm 
optimization |6fij . and a Bayesian optimization algorithm EH). Since param¬ 
eters are in different scales, we normalize these 1000 samples. The following 
inference is based on emulation with these 1000 design points. 

One advantage of GP emulated Monte Carlo is that Hamiltonian algorithms 
can still be applicable in the absence of derivative and metric information be¬ 
cause we can emulate them. Given weak priors, we sample from the posterior of 
the 9 parameters using RWM, GPeHMC (full versions of HMC are not available 
here due to the blackbox nature of the simulator codes), and compare their 
sampling efficiency in table 
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Algorithm 

AP 

s/iter 

ESS 

minESS/s 

spdup 

RWM 

0.77 

1.35E-03 

(43,81,116) 

3.20 

1.00 

GPeHMC 

0.89 

4.38E-03 

(5808,6023,6124) 

132.73 

41.49 


Table 4; Sampling Efficiency in Teal South oil reservoir problem. AP is the acceptance 
probability, s/iter is the CPU time (second) for each iteration, ESS has (min., med., max.) 
and minESS/s is the time-normalized ESS. Spdup is the speed up of sampling efficiency 
measured by minESS/s with RWM as the baseline. 
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Figure 11: left: Samples (thinned for 1 every 10 samples); right: log posterior (no burning or 
thinning) in Teal South oil reservoir problem 


Note, the emulated gradient apparently helps to improve the sampling effi¬ 
ciency, furthermore, given the approximate Gaussian form of the posterior there 
is no requirement for the higher order metric components of the manifold meth¬ 
ods. In figure [g the left panel compares the samples generated by different 
algorithms and the right panel compares their convergence rate. GPeHMC not 
only converges faster but also generates less auto-correlated samples than the 
others. 

7. Conclusion 

Geometric information of a probability distribution improves the efficiency 
of MCMC samplers in exploring the parameter space thus it can improve the 
mixing rate of the resulting Markov chain. In HMC, gradient information of the 
distribution guides the sampler to explore high density regions; in RHMC/LMC, 
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metric information further adapts the sampler to the local geometry of the dis¬ 
tribution. They suppress the random walk behaviour in the classical Metropolis 
algorithm however impose computational challenges by the requirement of these 
geometric objects (gradient, metric, etc). 

In this paper, we investigate emulation using Gaussian Processes as a cheaper 
alternative to exact calculation of geometric information, with the aim to reduce 
the computational cost of these geometric Monte Carlo methods. It is believed 
to be the first attempt to emulate higher order derivatives in MCMC samplers 
in order to improve their efficiency. Furthermore, we observe the importance 
of design points selection for Gaussian Process emulation and the necessity of 
refinement of the design set as it is impractical to initialise a good set. This 
is also one of the first attempts to introduce experiment design methods to 
emulation for MCMC. We refer to the regeneration technique to determine legal 
times when the adaptation is allowed based on the previous history. When the 
chain regenerates, the MICE algorithm is used to sequentially refine the design 
set and update the associated Gaussian Process emulator. 

Simulation studies and real problems have shown a substantial advantage 
of emulated geometric Monte Carlo methods over their original versions in this 
setting. There are many directions to further improve the proposed methods. 
One of them could be a flexible mechanism to determine proper design size 
depending on the dimension and complexity of the problem. It is natural to 
think that a 100 dimensional distribution that is nearly Gaussian may not need 
more design points than a 2d highly skew, ill-shaped distribution to provide 
decent emulation. Another direction could be the local predictor using a subset 
of neighbouring points instead of the whole set of design points. This is crucial 
to save computational cost as dimension grows. Computation of a large corre¬ 
lation matrix is involved in the design refinement step and Hierarchical matrix 
factorization [681169] may be a direction that can contribute. Lastly, it is nat¬ 
ural to parallelize the Markov chains using our method as Gaussian Processes 
can predict on multiple points simultaneously. This can be applied to parallel 
tempering [TDlinilZI]- 
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Appendix: Calculations and 

Proofs 


Appendix A. Maximum Likelihood Estimate of p 

The likelihood of p is as follows: 

L{p) (X (a2(p))-(”-«)/^|Cx,(p)ri/2|HTc^i(p)Hx,ri/2 (A.l) 

To find MLE of p, we want to use Trust Region Reflective algorithm to optimize 
the following function 

Kp) = -{n - 9)/21ogCT^(p) - l/21ogdetC2)(p) - 1/2 logdet B 2 ,(p) (A.2) 


And we need the gradients and Hessians [See also ST] : 
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Appendix B. Proof of proposition 4.2 


Since both C/(0*)|ux) and U{6*)\ux) are unbiased estimators for U{9*), 
MSPE’s are just covariance cr^C** as in ( [l7| pT] , We can write the correlation 
matrix conditioned on ux,, denoted C**, as follows 


C** = Cp - Hrt 
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Denote Schur complements of A and D as Sa := D - BA-^BT and So := 
A — respectively. According to block matrix inversion and Sherman- 

Morrison-Woodbury formula m, we have 
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where > 0 because Sa = Cx)ix) 
Cor(ux)i |ux)o) > 0. 
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Appendix C. Algorithms 
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Algorithm 1 Mutual Information for Computer Experiments (MICE) 

Given U{9), 0, QV{h{-),C{-,-; p)), and initial design (3e, u^) 

Step 1. MLE to obtain estimates p in C(-, •; p)) 

Step 2. Eit CP model to (Se, u®) 

Step 3. Generate S)e'^ with repect to S)e and then choose ®cand{Q Se'^) 

Step 4. Solve 6* = argmaxeg®^,^^^ Var([/(0; p, ^^)|us))/Var(t/(0; p, i/^)|C/(2)e^\{6})) 
Step 5. Evaluate U{6*) = {U{6*),dU{9*)) and set iDe = S)e U {0*},U2) = 

U2, U17(6>*) 

Step 6. If the design 3e has reached the desired size, then stop; otherwise 
go to step 1 

Output: Design IDe and a GP emulator built on it 


Algorithm 2 Adaptive Gaussian Process emulated Geometric Monte Garlo 
(adpGPeGMG) 


Given current design (Se, ug), initialize GP emulator ^7^(h(-), C(-, sp)) 
for n = 1 to Nsamp do 

1st kernel: Sample 9 as the current state by GPeGMG. 

2nd kernel: Fit a mixture of Gaussians q{-) with current design 
(SCjUxjjgFIj,). Propose 9* ~ q{-) and accept it with probability a = 


’r(e)/<?(e) / 


if 9* accepted then 

Determine if 9* is a regeneration using ( [^ with 9t = 9 and 9t+i = 9*. 
if Regeneration occurs then 

Adapt current design through MIGE algorithm and refine the asso¬ 
ciated GP emulator. 

Discard 9*, sample ^ Q(-) as in ( [M| ) using rejection sampling, 

else 


Set = 9*. 


end if 
else 


Set = 9. 


end if 
end for 
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